Lesson 6

Welcome

Notes: We are going to explore the diamonds dataset so that we can predict the diamond prices. This requires linear regression models.


Scatterplot Review

library('ggplot2')
data(diamonds)

ggplot(aes(x=carat, y=price), data= diamonds) + geom_point() + scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99))) + scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)))
## Warning: Removed 926 rows containing missing values (geom_point).


Price and Carat Relationship

Response: theres a lot of diamonds that have 0.5 increments. maybe the carats are cut in 0.5 increments?

From the video: there is an exponential relationship between price and carat. If you add a linear model line (lm) it will show that the scatterplot deviates from the linear line.

ggplot(aes(x=carat, y=price), data= diamonds) + 
  geom_point(color = '#F79420', alpha = 1/4) + 
  stat_smooth(method = 'lm') +
  scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99))) +
  scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)))
## Warning: Removed 926 rows containing non-finite values (stat_smooth).
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_smooth).

***

Frances Gerety

Notes: DeBeers cartel was formed in 1888 to control the prices of diamonds. During WWI and the Great Depression, prices of diamonds plummeted and nobody wanted to purchase diamonds. But then in 1938, Frances Gerety, an employee of an advertising agency called N.W. Ayer & Son, took on the De Beers diamonds and quoted a slogan of the century…

A diamonds is

Forever


The Rise of Diamonds

Notes: This gave birth to modern demand advertising. The objective was to increase the sentimental value of diamonds rather than brand strengthening or increasing demand. Celebrities, kentucky derby, grammy awards, and even the royal families wore diamonds. 2 months salary spent on diamonds, something that lasted forever, was seen of value. Today, everybody who gets engaged will get a diamond.


ggpairs Function

Notes:

# install these if necessary
# install.packages('GGally')
# install.packages('scales')
# install.packages('memisc')
# install.packages('lattice')
# install.packages('MASS')
# install.packages('car')
# install.packages('reshape')
# install.packages('plyr')

# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
## 
##     percent
## The following object is masked from 'package:ggplot2':
## 
##     syms
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
## 
##     as.array
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp,
  lower = list(continuous = wrap("points", shape = I('.'))),
  upper = list(combo = wrap("box", outlier.shape = I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What are some things you notice in the ggpairs output? Response: the graph follows a semi-linear line. it looks almost as if its logarithmic


The Demand of Diamonds

Notes: Looking back at the scatterplot between price and carat, those with lower cararts have a higher saturation of points because buyers care about the price. As you get to the bigger carats, price flucutates wildly because the buyer usually has more money and does not care for the price. They care for the sentiments of the diamonds.

# Create two histograms of the price variable
# and place them side by side on one output image.
# We’ve put some code below to get you started.
# The first plot should be a histogram of price
# and the second plot should transform
# the price variable using log10.
# Set appropriate bin widths for each plot.
# ggtitle() will add a title to each histogram.
# You can self-assess your work with the plots
# in the solution video.
library(gridExtra)
library(grid)
plot1 <- qplot(data= diamonds, x= price, fill= I('#099DD9'), binwidth = 100) + 
  ggtitle('Price')
plot2 <- qplot(data= diamonds, x= log10(price), fill= I('#F79420'), binwidth = 0.01) +
  ggtitle('Price (log10)')
grid.arrange(plot1, plot2, ncol = 2)


Connecting Demand and Price Distributions

Notes: a lot of demand for diamonds at price 1000 and 10000 are high. This bimodality, as seen on the log10(price) graph above, shows the rich-poor class. There is a normal distribution for each class. ***

Scatterplot Transformation

qplot(carat, price, data = diamonds) +
  scale_y_continuous(trans=log10_trans() ) + 
  ggtitle('Price (log10) by Carat')

Create a new function to transform the carat variable

cuberoot_trans = function(){
  trans_new('cuberoot',
            transform = function(x) {x^(1/3)},
            inverse = function(x){ x^3})
  }

Use the cuberoot_trans function

ggplot(aes(carat, price), data = diamonds) + 
  geom_point() + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1683 rows containing missing values (geom_point).


Overplotting Revisited

head(sort(table(diamonds$carat), decreasing = T))
## 
##  0.3 0.31 1.01  0.7 0.32    1 
## 2604 2249 2242 1981 1840 1558
head(sort(table(diamonds$price), decreasing = T))
## 
## 605 802 625 828 776 698 
## 132 127 126 125 124 121
ggplot(aes(carat, price), data = diamonds) + 
  geom_point(alpha=0.5, position = 'jitter', size = 3/4) + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1691 rows containing missing values (geom_point).


Other Qualitative Factors

Notes:


Price vs. Carat and Clarity

Alter the code below.

# install and load the RColorBrewer package
# install.packages('RColorBrewer', dependencies = True)
library(ggplot2)
library(RColorBrewer)

names(diamonds)
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"
ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) + geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Clarity', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1693 rows containing missing values (geom_point).


Clarity and Price

Response: yes it does because less imperfections, the better the diamond will look.


Price vs. Carat and Cut

Alter the code below.

ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Cut', 
              reverse = T, 
              override.aes = list(alpha = 1,size = 2))) +
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Cut')
## Warning: Removed 1696 rows containing missing values (geom_point).


Cut and Price

Response: No, the majority of cases are ideal cut so it does not matter whether cut affects price. We lost the color pattern we saw before with clarity and price.


Price vs. Carat and Color

Alter the code below.

ggplot(aes(x = carat, y = price, color = color), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Color', reverse = F, override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1688 rows containing missing values (geom_point).


Color and Price

Response: yes, there is a variety of colors on the graph and that there is a direct relationship between color and price ***

Linear Models in R

Notes: predict using the lm() function! lm(y~x) where y is the outcome variable and x is the explanatory variable. Which of the formulas would we use inside the lm() function?

Response: log(price) ~ carat^(1/3) ***

Building the Linear Model

Notes:

m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamonds)
## 
## ============================================================================================
##                        m1             m2             m3             m4            m5        
## --------------------------------------------------------------------------------------------
##   (Intercept)          2.821***       1.039***       0.874***      0.932***       0.415***  
##                       (0.006)        (0.019)        (0.019)       (0.017)        (0.010)    
##   I(carat^(1/3))       5.558***       8.568***       8.703***      8.438***       9.144***  
##                       (0.007)        (0.032)        (0.031)       (0.028)        (0.016)    
##   carat                              -1.137***      -1.163***     -0.992***      -1.093***  
##                                      (0.012)        (0.011)       (0.010)        (0.006)    
##   cut: .L                                            0.224***      0.224***       0.120***  
##                                                     (0.004)       (0.004)        (0.002)    
##   cut: .Q                                           -0.062***     -0.062***      -0.031***  
##                                                     (0.004)       (0.003)        (0.002)    
##   cut: .C                                            0.051***      0.052***       0.014***  
##                                                     (0.003)       (0.003)        (0.002)    
##   cut: ^4                                            0.018***      0.018***      -0.002     
##                                                     (0.003)       (0.002)        (0.001)    
##   color: .L                                                       -0.373***      -0.441***  
##                                                                   (0.003)        (0.002)    
##   color: .Q                                                       -0.129***      -0.093***  
##                                                                   (0.003)        (0.002)    
##   color: .C                                                        0.001         -0.013***  
##                                                                   (0.003)        (0.002)    
##   color: ^4                                                        0.029***       0.012***  
##                                                                   (0.003)        (0.002)    
##   color: ^5                                                       -0.016***      -0.003*    
##                                                                   (0.003)        (0.001)    
##   color: ^6                                                       -0.023***       0.001     
##                                                                   (0.002)        (0.001)    
##   clarity: .L                                                                     0.907***  
##                                                                                  (0.003)    
##   clarity: .Q                                                                    -0.240***  
##                                                                                  (0.003)    
##   clarity: .C                                                                     0.131***  
##                                                                                  (0.003)    
##   clarity: ^4                                                                    -0.063***  
##                                                                                  (0.002)    
##   clarity: ^5                                                                     0.026***  
##                                                                                  (0.002)    
##   clarity: ^6                                                                    -0.002     
##                                                                                  (0.002)    
##   clarity: ^7                                                                     0.032***  
##                                                                                  (0.001)    
## --------------------------------------------------------------------------------------------
##   R-squared            0.924          0.935          0.939         0.951          0.984     
##   adj. R-squared       0.924          0.935          0.939         0.951          0.984     
##   sigma                0.280          0.259          0.250         0.224          0.129     
##   F               652012.063     387489.366     138654.523     87959.467     173791.084     
##   p                    0.000          0.000          0.000         0.000          0.000     
##   Log-likelihood   -7962.499      -3631.319      -1837.416      4235.240      34091.272     
##   Deviance          4242.831       3613.360       3380.837      2699.212        892.214     
##   AIC              15930.999       7270.637       3690.832     -8442.481     -68140.544     
##   BIC              15957.685       7306.220       3761.997     -8317.942     -67953.736     
##   N                53940          53940          53940         53940          53940         
## ============================================================================================

Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with our exploration earlier.


Model Problems

Video Notes: our model is.. ln(price) = 0.415 + 9.144 * carat^(1/3) - 1.093Carat + (…cut+…color+…clarity) + E

what should be some problems when using this model? What else should we think about when using this model?

Research: (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this)

Response: this data set is only from 2008-2014 so we need to account for inflation and the diamond market is a lot different it is today than from 2008. There was a 2008 recession that lowered diamond market as well. Diamond prices were unevenly adjusted based on carat weight so the model that we found could simply be adjusted for inflation. ***

A Bigger, Better Data Set

Notes:

# install.packages('bitops')
# install.packages('RCurl')
library('bitops')
library('RCurl')

diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
load('BigDiamonds.Rda')

The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data

Building a Model Using the Big Diamonds Data Set

Notes:

diamondsbig$logprice <- log(diamondsbig$price)
m1 <- lm(logprice ~ I(carat^(1/3)), 
         data = diamondsbig[diamondsbig$price < 10000 &
                              diamondsbig$cert == 'GIA',])
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
## 
## Calls:
## m1: lm(formula = logprice ~ I(carat^(1/3)), data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m2: lm(formula = logprice ~ I(carat^(1/3)) + carat, data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m3: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut, data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m4: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == 
##         "GIA", ])
## m5: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == 
##     "GIA", ])
## 
## ================================================================================================
##                         m1              m2             m3             m4              m5        
## ------------------------------------------------------------------------------------------------
##   (Intercept)           2.671***        1.333***       0.949***       0.529***       -0.464***  
##                        (0.003)         (0.012)        (0.012)        (0.010)         (0.009)    
##   I(carat^(1/3))        5.839***        8.243***       8.633***       8.110***        8.320***  
##                        (0.004)         (0.022)        (0.021)        (0.017)         (0.012)    
##   carat                                -1.061***      -1.223***      -0.782***       -0.763***  
##                                        (0.009)        (0.009)        (0.007)         (0.005)    
##   cut: V.Good                                          0.120***       0.090***        0.071***  
##                                                       (0.002)        (0.001)         (0.001)    
##   cut: Ideal                                           0.211***       0.181***        0.131***  
##                                                       (0.002)        (0.001)         (0.001)    
##   color: K/L                                                          0.123***        0.117***  
##                                                                      (0.004)         (0.003)    
##   color: J/L                                                          0.312***        0.318***  
##                                                                      (0.003)         (0.002)    
##   color: I/L                                                          0.451***        0.469***  
##                                                                      (0.003)         (0.002)    
##   color: H/L                                                          0.569***        0.602***  
##                                                                      (0.003)         (0.002)    
##   color: G/L                                                          0.633***        0.665***  
##                                                                      (0.003)         (0.002)    
##   color: F/L                                                          0.687***        0.723***  
##                                                                      (0.003)         (0.002)    
##   color: E/L                                                          0.729***        0.756***  
##                                                                      (0.003)         (0.002)    
##   color: D/L                                                          0.812***        0.827***  
##                                                                      (0.003)         (0.002)    
##   clarity: I1                                                                         0.301***  
##                                                                                      (0.006)    
##   clarity: SI2                                                                        0.607***  
##                                                                                      (0.006)    
##   clarity: SI1                                                                        0.727***  
##                                                                                      (0.006)    
##   clarity: VS2                                                                        0.836***  
##                                                                                      (0.006)    
##   clarity: VS1                                                                        0.891***  
##                                                                                      (0.006)    
##   clarity: VVS2                                                                       0.935***  
##                                                                                      (0.006)    
##   clarity: VVS1                                                                       0.995***  
##                                                                                      (0.006)    
##   clarity: IF                                                                         1.052***  
##                                                                                      (0.006)    
## ------------------------------------------------------------------------------------------------
##   R-squared             0.888           0.892          0.899          0.937           0.969     
##   adj. R-squared        0.888           0.892          0.899          0.937           0.969     
##   sigma                 0.289           0.284          0.275          0.216           0.154     
##   F               2700903.714     1406538.330     754405.425     423311.488      521161.443     
##   p                     0.000           0.000          0.000          0.000           0.000     
##   Log-likelihood   -60137.791      -53996.269     -43339.818      37830.414      154124.270     
##   Deviance          28298.689       27291.534      25628.285      15874.910        7992.720     
##   AIC              120281.582      108000.539      86691.636     -75632.827     -308204.540     
##   BIC              120313.783      108043.473      86756.037     -75482.557     -307968.400     
##   N                338946          338946         338946         338946          338946         
## ================================================================================================

Predictions

Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601

#Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
                         color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
                        interval="prediction", level = 0.95)

Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.

The fit falls within the 95% confidence interval so it is a good fit


Final Thoughts

Notes: You can always use the four C’s in finding the right diamond. It does not matter where you buy your diamond, they will surely be the same price everywhere.


Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!